-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Optimize nrows in read_excel #35974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Optimize nrows in read_excel #35974
Conversation
pandas/io/excel/_base.py
Outdated
@@ -453,7 +491,20 @@ def parse( | |||
else: # assume an integer if not a string | |||
sheet = self.get_sheet_by_index(asheetname) | |||
|
|||
data = self.get_sheet_data(sheet, convert_float) | |||
get_sheet_data_header = 0 if header is None else header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required? Looks like there are already checks within the subsequent functions for int values, no?
pandas/io/excel/_xlrd.py
Outdated
|
||
for i in range(sheet_nrows): | ||
if self.should_skip_row(i, header, skiprows, nrows): | ||
data.append([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to append the empty list here? Would be preferable to just continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed should_skip_row
entirely, I think the benefit from the original PR came from not reading all the rows into memory, rather than from optimising skipping rows
Performance benchmarks:
The OP mentioned a memory leak from skipping rows in openpyxl, which I experienced too, so I've skipped optimising those files for now |
looks good, i thought this would make a much bigger difference, but ok. cc @WillAyd |
Thanks @MarcoGorelli |
elif skiprows is None: | ||
skiprows_nrows = 0 | ||
else: | ||
skiprows_nrows = len(skiprows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like its causing failures on master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel can you revert for now and we can reopen
This reverts commit e975f3d.
@MarcoGorelli we are reverting this as something is broken in the evaluation here. If you can resubmit when you can. |
Sure - I'm really sorry for the breakage caused, but glad this was caught early! |
…-dev#36537) This reverts commit e975f3d.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
based on #33281
output of asv benchmarks: